\iffullversion
\else

\section{Voter Query Details}\label{sec:voterDetails}


\begin{figure*}[t!]\centering\footnotesize
	% \begin{figure*}[h]\centering
	\begin{tabular}{|l|| r | r |r |r|r|r||r|r | r |r |r|r|}
		\cline{2-13}
		\multicolumn{1}{c}{}         & \multicolumn{6}{|c||}{LAN}                                   & \multicolumn{6}{|c|}{WAN}                                    \\ \hline
		\multirow{2}{*}{Application} &                  \multicolumn{6}{c||}{$n$}                   &                   \multicolumn{6}{c|}{$n$}                   \\
		& $2^8$ & $2^{12}$ & $2^{16}$ & $2^{20}$ & $2^{24}$ & $2^{26}$ & $2^8$ & $2^{12}$ & $2^{16}$ & $2^{20}$ & $2^{24}$ & $2^{26}$ \\ \hline\hline
		Voter Intra-state            & 0.01  & 0.02     & 0.2      & 4.7      &    114.7 &  2,190.1 &   1.0 &      1.0 & 2.2      & 27.1     & 456.1    &  7,463.9 \\ \hline
		Voter Inter-state            & 0.01  & 0.02     & 0.3      & 7.0      &    134.8 &  2,546.4 &   1.6 &      1.6 & 4.0      & 45.4     & 747.7    & 12,284.1 \\ \hline
		Threat Log $N=2$          & 0.02  & 0.03     & 0.2      & 5.1      &    121.4 &    488.1 &   2.4 &      2.5 & 4.8      & 34.6     & 585.6    &  2,342.4 \\ \hline
		Threat Log $N=4$           & 0.05  & 0.09     & 0.9      & 17.9     &    388.4 &  1,553.9 &   6.6 &      6.8 & 13.1     & 108.7    & 1,739.2  &  6,956.8 \\ \hline
		Threat Log $N=8$          & 0.10  & 0.19     & 1.7      & 47.1     &  1,021.0 & 16,336.1 &  14.9 &     15.3 & 30.0     & 264.3    & 4,228.8  & 16,915.2 \\ \hline
	\end{tabular}
\vspace{-0.3cm}
	\caption{\label{fig:app} The running time in seconds for the Voter Registration and Threat Log applications. The input tables each contain $n$ rows.  }
	\vspace{-0.3cm}
\end{figure*}

Given the problem statement from \ref{sec:voter}, a naive solution is to construct a centralized database of all the registered voters and citizen records. It is then a relatively straightforward process to identify persons with inaccurate records, attempt to double register or are simply not register at all. However, the construction of such a centralized repository of information has long had strong opposition in the United States due to concerns of data privacy and  excessive government overreach. As a compromise many states have volunteered to join the Electronic Registration Information Center (ERIC)\cite{eric} which is a non-profit organization with the mission of assisting states to improve the accuracy of America’s voter rolls and increase access to voter registration for all eligible citizens. This organization acts as a semi-trusted third party which maintains a centralized database containing hashes of the relevant information, e.g. names, addresses, drivers license number and social security number. 
\fi

In particular, instead of storing this sensitive information in plaintext, all records are randomized using two cryptographically strong salted hash functions. Roughly speaking, before this sensitive information is sent to ERIC, each state is provided with the first salt value $salt_1$ and updates each value $v$ as $v := H(salt_1 || v)$. This hashed data is then sent to ERIC where the data is hashed a second time by ERIC which possesses the other salt value. The desired comparisons can then be applied to the hashed data inside ERIC's secure data center. When compared with existing alternative, this approach provides a moderate degree of protection. In particular, so long as the salt values remain inaccessible by the adversary, deanatomized any given record is likely non-trivial. However, a long series of works, e.g. \cite{deanon0,deanon1,deanon2,deanon3,deanon4}, have shown that a significant amount of information can be extracted with sophisticated statistical techniques. Moreover, should the adversary possess the salt values a straightforward dictionary attack can be applied.

We propose adding another layer of security with the deployment of our secure database join framework. In particular, two or more of the states and ERIC will participate in the MPC protocol. From here we consider two possible solutions. The first option is to maintain the existing repository but now have it secret shared between the computational parties. Alternatively, each state could be the long-term holder of their own data and the states perform all pairwise comparison amongst themselves. For reason of preferring the more distributed setting we further explore the pairwise comparison approach. 

The situation is further complicated by how this data is distributed within and between states. In the typical setting no single state organization has sufficient information to identify individuals which are incorrectly or double registered. For example, typical voter registration forms requires a name, home address and state ID/driver's license number. If two states compared this information there would be no reliable attribute for joining the two records. The name of the voter could be used but names are far from a unique identifier. The solution taken by ERIC is to first perform a join between a state's registered voters and their Department of Motor Vehicles (DMV) records, using the  state ID/driver's license number as the join-key. Since the DMV typically possesses an individual's Social Security Number (SSN), this can now be used as a unique identifier across all states. However, due to regulations within some states this join is only allowed to be performed on the hashed data or, presumably, on secret shared data.

In addition to identifying individuals that are double registered, the mission of ERIC is to generally improve the accuracy of all voter records. This includes identifying individuals that have moved and not yet registered in their new state or that have simply moved within a state and not updated their current address. In this case the joins between/within states should also include an indicator denoting that an individual has updated their address at a DMV which is different than the voter registration record. There are likely other scenarios which ERIC also identifies but we leave the exploration of them to future work.

Given the building blocks of \sectionref{sec:construction} it is a relatively straightforward task to perform the required joins. First a state performs a left join between their DMV data and the voter registration data. Within this join the addresses in the inner join are compared. In the event of a discrepancy, the date of when these addresses were obtained can be compared to identify the most up to date address. Moreover, the agency with the older address can be notified and initial a procedure to determine which, if any, of the addresses should be updated. 


Once this join is performed, each state holds a secret shared table of all their citizens that possess a state ID and their current registration status. Each pair of states can then run an inner join protocol using the social security number as the key. There are several cases that a result record can be in. First it is possible for a person to have a DMV record in two states and be registered in neither. The identity of these persons should not be revealed as this does not effect the voting process. The next case is that a person is registered in both states. We wish to reveal this group to both states so that the appropriate action can be taken. The final case that we are interested in is when a person is registered in state \emph{A} and has a newer DMV address in state \emph{B}. In this case we want to reveal the identity of the person to the state that they are registered to. This state can then contact the person to inquire whether they wish to switch their registration to the new state. 


This approach has additional advantages over the hashing technique of ERIC. First, all of the highly sensitive information such as a persons address, state ID number and SSN can still be hashed before being added to the database\footnote{The hashing originally performed by ERIC can be replaced with the randomized encoding protocol.}. However, now that the data is secret shared less sensitive information such as dates need not be hashed. This allows for the more expressive query described above which uses a numerical comparison. To achieve the the same functionality using the current ERIC approach these dates would have to be stored in plaintext which leaks significant information. In addition, when the ERIC approach performs these comparison the truth value for each party of the predicate is revealed. Our approach reveals no information about any intermediate value. 

\begin{figure}[ht]
	
	{
		\scriptsize
		\begin{align*}	
		stateA = \texttt{select }& DMV.name,\\
		& DMV.ID, \\
		& DMV.SSN, \\
		& DVM.date > Voter.date \ ?\\
		&\quad DMV.date : Voter.date \texttt{ as } date,\\
		& DVM.date > Voter.date \ ?\\
		&\quad  DMV.address : Voter.address \texttt{ as } address,\\
		& DVM.address \neq Voter.address \texttt{ as } mixedAddress, \\ 
		& Voter.name \neq \texttt{ NULL as } registered \\
		\texttt{ from } & DMV \texttt{ left join } Voter \\
		\texttt{ on } & DMV.ID = Voter.ID \\
		stateB = \texttt{select }&...\\
		resultA = \texttt{select } & stateA.SSN \\
		& stateA.address \texttt{ as } addressA\\
		& stateB.address \texttt{ as } addressB\\
		& stateA.registered \\
		& stateB.registered \\
		\texttt{ from } & stateA \texttt{ inner join } stateB \\
		\texttt{ on } & stateA.SSN = stateB.SSN\\
		\texttt{ where } & (stateA.date < stateB.date \texttt{ and } stateA.registered ) \\
		\texttt{ or } & (stateA.registered \qquad \qquad \ \, \texttt{ and }stateB.registered )\\
		resultB = \texttt{select } & ...
		\end{align*}
	}
	\caption{SQL styled join query for the ERIC voter registration application. \label{fig:voterQuery}}
\end{figure}

Once the parties construct the tables in \figureref{fig:voterQuery}, state \emph{A} can query the table $stateA$ to reveal all IDs and addresses where the $mixedAddress$ attribute is set to \texttt{true}. This reveals exactly the people who have conflicting addresses between that state's voter and DMV databases. When comparing voter registration data between states, state B should define $stateB$ in a symmetric manner as $stateA$. The table $resultA$ contains all of the records which are revealed to state \emph{A} and $resultB$, which is symmetrically defined, contains the results for state \emph{B}. We note that $resultA$ and $resultB$ can be constructed with only one join.


Both types of these queries can easily be performed in our secure framework. All of the conditional logic for the select and where clauses are implemented using a binary circuit immediately after the primary join protocol is performed. This has the effect that overhead of these operation is simply the size of the circuit which implements the logic times the number of potential rows contained in the output. 